Biclustering Sparse Binary Genomic Data
نویسندگان
چکیده
Genomic datasets often consist of large, binary, sparse data matrices. In such a dataset, one is often interested in finding contiguous blocks that (mostly) contain ones. This is a biclustering problem, and while many algorithms have been proposed to deal with gene expression data, only two algorithms have been proposed that specifically deal with binary matrices. None of the gene expression biclustering algorithms can handle the large number of zeros in sparse binary matrices. The two proposed binary algorithms failed to produce meaningful results. In this article, we present a new algorithm that is able to extract biclusters from sparse, binary datasets. A powerful feature is that biclusters with different numbers of rows and columns can be detected, varying from many rows to few columns and few rows to many columns. It allows the user to guide the search towards biclusters of specific dimensions. When applying our algorithm to an input matrix derived from TRANSFAC, we find transcription factors with distinctly dissimilar binding motifs, but a clear set of common targets that are significantly enriched for GO categories.
منابع مشابه
BiBinConvmean : A Novel Biclustering Algorithm for Binary Microarray Data
In this paper, we present a new algorithm called, BiBinConvmean, for biclustering of binary microarray data. It is a novel alternative to extract biclusters from sparse binary datasets. Our algorithm is based on Iterative Row and Column Clustering Combination (IRCCC) and Divide and Conquer (DC) approaches, K-means initialization and the CroBin evaluation function [6]. Applied on binary syntheti...
متن کاملSparse group factor analysis for biclustering of multiple data sources
MOTIVATION Modelling methods that find structure in data are necessary with the current large volumes of genomic data, and there have been various efforts to find subsets of genes exhibiting consistent patterns over subsets of treatments. These biclustering techniques have focused on one data source, often gene expression data. We present a Bayesian approach for joint biclustering of multiple d...
متن کاملOptimal Estimation and Completion of Matrices with Biclustering Structures
Biclustering structures in data matrices were first formalized in a seminal paper by John Hartigan [15] where one seeks to cluster cases and variables simultaneously. Such structures are also prevalent in block modeling of networks. In this paper, we develop a theory for the estimation and completion of matrices with biclustering structures, where the data is a partially observed and noise cont...
متن کاملContext Specific and Differential Gene Co-expression Networks via Bayesian Biclustering
Identifying latent structure in high-dimensional genomic data is essential for exploring biological processes. Here, we consider recovering gene co-expression networks from gene expression data, where each network encodes relationships between genes that are co-regulated by shared biological mechanisms. To do this, we develop a Bayesian statistical model for biclustering to infer subsets of co-...
متن کاملBiclustering Gene-Feature Matrices for Statistically Significant Patterns
Biclustering is an important problem that arises in diverse applications, including analysis of gene expression and drug interaction data. The problem can be formalized in various ways through different interpretation of data and associated optimization functions. We focus on the problem of finding unusually dense patterns in binary (0-1) matrices. This formulation is appropriate for analyzing ...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- Journal of computational biology : a journal of computational molecular cell biology
دوره 15 10 شماره
صفحات -
تاریخ انتشار 2008